{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Text Classification using TensorFlow/Keras on Cloud ML Engine </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Creating datasets for Machine Learning using BigQuery\n",
    "<li> Creating a text classification model using the Estimator API with a Keras model\n",
    "<li> Training on Cloud ML Engine\n",
    "<li> Deploying the model\n",
    "<li> Predicting with model\n",
    "<li> Rerun with pre-trained embedding\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'\n",
    "PROJECT = 'cloud-training-demos'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION\n",
    "os.environ['TFVERSION'] = '1.8'\n",
    "\n",
    "if 'COLAB_GPU' in os.environ:  # this is always set on Colab, the value is 0 or 1 depending on whether a GPU is attached\n",
    "  from google.colab import auth\n",
    "  auth.authenticate_user()\n",
    "  # download \"sidecar files\" since on Colab, this notebook will be on Drive\n",
    "  !rm -rf txtclsmodel\n",
    "  !git clone --depth 1 https://github.com/GoogleCloudPlatform/training-data-analyst\n",
    "  !mv  training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/ .\n",
    "  !rm -rf training-data-analyst\n",
    "  # downgrade TensorFlow to the version this notebook has been tested with\n",
    "  !pip install --upgrade tensorflow==$TFVERSION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub. \n",
    "\n",
    "We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various  sources."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating Dataset from BigQuery \n",
    "\n",
    "Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. \n",
    "\n",
    "Here is a sample of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT\n",
    "  url, title, score\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  LENGTH(title) > 10\n",
    "  AND score > 10\n",
    "  AND LENGTH(url) > 0\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT\n",
    "  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "  COUNT(title) AS num_articles\n",
    "FROM\n",
    "  `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "  AND LENGTH(title) > 10\n",
    "GROUP BY\n",
    "  source\n",
    "ORDER BY num_articles DESC\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery\n",
    "bq = bigquery.Client(project=PROJECT)\n",
    "\n",
    "query=\"\"\"\n",
    "SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM\n",
    "  (SELECT\n",
    "    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "    title\n",
    "  FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "  WHERE\n",
    "    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "    AND LENGTH(title) > 10\n",
    "  )\n",
    "WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')\n",
    "\"\"\"\n",
    "\n",
    "df = bq.query(query + \" LIMIT 5\").to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  \n",
    "\n",
    "A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "traindf = bq.query(query + \" AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0\").to_dataframe()\n",
    "evaldf  = bq.query(query + \" AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0\").to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we can see that roughly 75% of the data is used for training, and 25% for evaluation. \n",
    "\n",
    "We can also see that within each dataset, the classes are roughly balanced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "traindf['source'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaldf['source'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we will save our data, which is currently in-memory, to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, shutil\n",
    "DATADIR='data/txtcls'\n",
    "shutil.rmtree(DATADIR, ignore_errors=True)\n",
    "os.makedirs(DATADIR)\n",
    "traindf.to_csv( os.path.join(DATADIR,'train.tsv'), header=False, index=False, encoding='utf-8', sep='\\t')\n",
    "evaldf.to_csv( os.path.join(DATADIR,'eval.tsv'), header=False, index=False, encoding='utf-8', sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -3 data/txtcls/train.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wc -l data/txtcls/*.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TensorFlow/Keras Code\n",
    "\n",
    "Please explore the code in this <a href=\"txtclsmodel/trainer\">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job.\n",
    "\n",
    "In particular look for the following:\n",
    "\n",
    "1. [tf.keras.preprocessing.text.Tokenizer.fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) to generate a mapping from our word vocabulary to integers\n",
    "2. [tf.keras.preprocessing.text.Tokenizer.texts_to_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences) to encode our sentences into a sequence of their respective word-integers\n",
    "3. [tf.keras.preprocessing.sequence.pad_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) to pad all sequences to be the same length\n",
    "\n",
    "The embedding layer in the keras model takes care of one-hot encoding these integers and learning a dense emedding represetation from them. \n",
    "\n",
    "Finally we pass the embedded text representation through a CNN model pictured below\n",
    "\n",
    "<img src=images/txtcls_model.png  width=25%>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Locally (optional step)\n",
    "Let's make sure the code compiles by running locally for a fraction of an epoch.\n",
    "This may not work if you don't have all the packages installed locally for gcloud (such as in Colab).\n",
    "This is an optional step; move on to training on the cloud."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "source activate py2env  # gcloud uses python2 by default\n",
    "## Make sure we have the latest version of Google Cloud Storage package\n",
    "pip install --upgrade google-cloud-storage\n",
    "rm -rf txtcls_trained\n",
    "gcloud ml-engine local train \\\n",
    "   --module-name=trainer.task \\\n",
    "   --package-path=${PWD}/txtclsmodel/trainer \\\n",
    "   -- \\\n",
    "   --output_dir=${PWD}/txtcls_trained \\\n",
    "   --train_data_path=${PWD}/data/txtcls/train.tsv \\\n",
    "   --eval_data_path=${PWD}/data/txtcls/eval.tsv \\\n",
    "   --num_epochs=0.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train on the Cloud\n",
    "\n",
    "Let's first copy our training data to the cloud:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud storage cp data/txtcls/*.tsv gs://${BUCKET}/txtcls/"   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/txtcls/trained_fromscratch\n",
    "JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    " --region=$REGION \\\n",
    " --module-name=trainer.task \\\n",
    " --package-path=${PWD}/txtclsmodel/trainer \\\n",
    " --job-dir=$OUTDIR \\\n",
    " --scale-tier=BASIC_GPU \\\n",
    " --runtime-version=$TFVERSION \\\n",
    " -- \\\n",
    " --output_dir=$OUTDIR \\\n",
    " --train_data_path=gs://${BUCKET}/txtcls/train.tsv \\\n",
    " --eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \\\n",
    " --num_epochs=5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the job name appropriately. View the job in the console, and wait until the job is complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud ml-engine jobs describe txtcls_190209_224828"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Monitor training with TensorBoard\n",
    "\n",
    "If Tensorboard appears blank, try refreshing after a few minutes.\n",
    "If you removed the directory that Tensorboard was monitoring, stop and restart Tensorboard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.datalab.ml import TensorBoard\n",
    "TensorBoard().start('gs://{}/txtcls/trained_fromscratch'.format(BUCKET))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for pid in TensorBoard.list()['pid']:\n",
    "  TensorBoard().stop(pid)\n",
    "  print('Stopped TensorBoard with pid {}'.format(pid))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "What accuracy did you get? You should see around 80%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy trained model \n",
    "\n",
    "Once your training completes you will see your exported models in the output directory specified in Google Cloud Storage. \n",
    "\n",
    "You should see one model for each training checkpoint (default is every 1000 steps)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud storage ls gs://${BUCKET}/txtcls/trained_fromscratch/export/exporter/"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will take the last export and deploy it as a REST API using Google Cloud Machine Learning Engine "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "MODEL_NAME=\"txtcls\"\n",
    "MODEL_VERSION=\"v1_fromscratch\"\n",
    "MODEL_LOCATION=$(gcloud storage ls gs://${BUCKET}/txtcls/trained_fromscratch/export/exporter/ | tail -1)\n",    "#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} --quiet\n",
    "#gcloud ml-engine models delete ${MODEL_NAME}\n",
    "gcloud ml-engine models create ${MODEL_NAME} --regions $REGION\n",
    "gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Predictions\n",
    "\n",
    "Here are some actual hacker news headlines gathered from July 2018. These titles were not part of the training or evaluation datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "techcrunch=[\n",
    "  'Uber shuts down self-driving trucks unit',\n",
    "  'Grover raises €37M Series A to offer latest tech products as a subscription',\n",
    "  'Tech companies can now bid on the Pentagon’s $10B cloud contract'\n",
    "]\n",
    "nytimes=[\n",
    "  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',\n",
    "  'A $3B Plan to Turn Hoover Dam into a Giant Battery',\n",
    "  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'\n",
    "]\n",
    "github=[\n",
    "  'Show HN: Moon – 3kb JavaScript UI compiler',\n",
    "  'Show HN: Hello, a CLI tool for managing social media',\n",
    "  'Firefox Nightly added support for time-travel debugging'\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our serving input function expects the already tokenized representations of the headlines, so we do that pre-processing in the code before calling the REST API.\n",
    "\n",
    "Note: Ideally we would do these transformation in the tensorflow graph directly instead of relying on separate client pre-processing code (see: [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml/#training_serving_skew)), howevever the pre-processing functions we're using are python functions so cannot be embedded in a tensorflow graph. \n",
    "\n",
    "See the <a href=\"text_classification_native.ipynb\">text_classification_native</a> notebook for a solution to this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "from tensorflow.python.keras.preprocessing import sequence\n",
    "from googleapiclient import discovery\n",
    "from oauth2client.client import GoogleCredentials\n",
    "import json\n",
    "\n",
    "requests = techcrunch+nytimes+github\n",
    "\n",
    "# Tokenize and pad sentences using same mapping used in the deployed model\n",
    "tokenizer = pickle.load( open( \"txtclsmodel/tokenizer.pickled\", \"rb\" ) )\n",
    "\n",
    "requests_tokenized = tokenizer.texts_to_sequences(requests)\n",
    "requests_tokenized = sequence.pad_sequences(requests_tokenized,maxlen=50)\n",
    "\n",
    "# JSON format the requests\n",
    "request_data = {'instances':requests_tokenized.tolist()}\n",
    "\n",
    "# Authenticate and call CMLE prediction API \n",
    "credentials = GoogleCredentials.get_application_default()\n",
    "api = discovery.build('ml', 'v1', credentials=credentials,\n",
    "            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')\n",
    "\n",
    "parent = 'projects/%s/models/%s' % (PROJECT, 'txtcls') #version is not specified so uses default\n",
    "response = api.projects().predict(body=request_data, name=parent).execute()\n",
    "\n",
    "# Format and print response\n",
    "for i in range(len(requests)):\n",
    "  print('\\n{}'.format(requests[i]))\n",
    "  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))\n",
    "  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))\n",
    "  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many of your predictions were correct?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rerun with Pre-trained Embedding\n",
    "\n",
    "In the previous model we trained our word embedding from scratch. Often times we get better performance and/or converge faster by leveraging a pre-trained embedding. This is a similar concept to transfer learning during image classification.\n",
    "\n",
    "We will use the popular GloVe embedding which is trained on Wikipedia as well as various news sources like the New York Times.\n",
    "\n",
    "You can read more about Glove at the project homepage: https://nlp.stanford.edu/projects/glove/\n",
    "\n",
    "You can download the embedding files directly from the stanford.edu site, but we've rehosted it in a GCS bucket for faster download speed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Copying gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt [Content-Type=text/plain]...\n",
      "- [1 files][661.3 MiB/661.3 MiB]      0.0 B/s                                   \n",
      "Operation completed over 1 objects/661.3 MiB.                                    \n"
     ]
    }
   ],
   "source": [
    "!gcloud storage cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt gs://$BUCKET/txtcls/"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the embedding is downloaded re-run your cloud training job with the added command line argument: \n",
    "\n",
    "` --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt`\n",
    "\n",
    "Be sure to change your OUTDIR so it doesn't overwrite the previous model.\n",
    "\n",
    "While the final accuracy may not change significantly, you should notice the model is able to converge to it much more quickly because it no longer has to learn an embedding from scratch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### References\n",
    "- This implementation is based on code from: https://github.com/google/eng-edu/tree/master/ml/guides/text_classification.\n",
    "- See the full text classification tutorial at: https://developers.google.com/machine-learning/guides/text-classification/\n",
    "-\n",
    "\n",
    "## Next step\n",
    "Client-side tokenizing in Python is hugely problematic. See <a href=\"text_classification_native.ipynb\">Text classification with native serving</a> for how to carry out the preprocessing in the serving function itself."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
